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Abstract 


Convolutional neural networks with many layers have recently been shown to 
achieve excellent results on many high-level tasks such as image classification, 
object detection and more recently also semantic segmentation. Particularly for 
semantic segmentation, a two-stage procedure is often employed. Hereby, con¬ 
volutional networks are trained to provide good local pixel-wise features for the 
second step being traditionally a more global graphical model. In this work we 
unify this two-stage process into a single joint training algorithm. We demon¬ 
strate our method on the semantic image segmentation task and show encouraging 
results on the challenging PASCAL VOC 2012 dataset. 

1 Introduction 

In the past few years, Convolutional Neural Networks (CNNs) have revolutionized computer vision. 
They have been shown to achieve state-of-the-art performance in a variety of vision problems, in¬ 
cluding image classification [19’ 311, object detection UTI . human pose estimation [32) , stereo (36), 
and caption generation 03 M El m M Cm). This is mainly due to their high representational 
power achieved by learning complex, non-linear dependencies. 

It is only very recently that convolutional nets have proven also very effective for semantic seg¬ 
mentation mmmmm. This is perhaps due to the fact that to achieve invariance, pooling 
operations are performed, often reducing the dimensionality of the prediction. A Markov random 
field (MRF) is then used as a refinement step in order to obtain segmentations that respect well 
segment boundaries. The seminal work of 02) showed that inference in fully connected MRFs is 
possible if the smoothness potentials are Gaussian. Impressive performance was demonstrated in se¬ 
mantic segmentation with hand craft features. Later, [ 3 ] extended the unary potentials to incorporate 
convolutional network features. However, these current approaches train the segmentation models 
in a piece-wise fashion, fixing the unary weights during learning of the parameters of the pairwise 
terms which enforce smoothness. 

In this paper we present an algorithm that is able to train jointly the parameters of the convolutional 
network defining the unary potentials as well as the smoothness terms taking into account the de¬ 
pendencies between the random variables. We demonstrate the effectiveness of our approach using 
the dataset of the PASCAL VOC 2012 challenge (9). 

2 Background 

We begin by describing how to learn probabilistic deep networks which take into account correla¬ 
tions between multiple output variables y = (j/i,..., vn) that are of interest to us. Moreover, a 
valid configuration y G y = n^Li 3^ is assumed to lie in the product space of the discrete variable 
domains X - {1,... |3^|}. 

For a given data sample x E X, and a parameter vector w G R A , the score F of a configuration 
y G y is generally modeled by the mapping F : X x y x R A —>> M. 
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Algorithm: Deep Learning 

Repeat until stopping criteria 

1. Forward pass to compute F(x,y;w) \/y G y 

2. Normalization via soft-max to obtain p(y \ x,w) 

3. Backward pass through definition of function via chain rule 

4. Parameter update 

Figure 1: Gradient descent for learning deep models. 


The prediction task amounts to finding the configuration 

y* = arg maxF(x, y\w), (1) 

yey 

which maximizes the score F(x,y;w). Note that the best scoring configuration y* is equivalently 
given as the maximizer of the probability distribution 

p(y | x,w) oc expF(x,y;w), 

since the exponential function is a monotone increasing function and the normalization constant is 
independent of the configuration y G y, i.e., it is constant indeed. 

The learning task is concerned with finding a parameter vector 


w 


* 


= arg max 


f[ p(y\x,w), 

(x,y)£T> 


( 2 ) 


which maximizes the likelihood of a given training set V = {(x,y)}. The training set consists 
of input-output pairs (x, y) which are assumed to be independent and identically distributed. Note 
that maximizing the likelihood is equivalent to maximizing the cross entropy between the modeled 
distribution p(y \ x,w) and a target distribution which places all its mass on the groundtruth con¬ 
figuration y. Throughout this work we make no further assumptions about the dependence of the 
scoring function F(x, y\ w) on the parameter vector w, i.e., F(x, y\ w) is generally neither convex 
nor smooth. 

For problems where the output-space size \y\ = n^Li | 34 | is in the thousands, we can exactly 
solve the inference task given in Eq. 0 by searching over all possible output space configura¬ 
tions y G y. In such a setting, those different configurations are typically referred to as different 
classes. Similarly, we normalize the distribution p(y \ x , w) by summing up the exponentiated score 
exp F(x,y;w) over all possibilities y G y. This is often referred to as a soft-max computation. 
Non-convexity and non-smoothness of the learning objective w.r.t. the parameters w is answered 
with stochastic gradient ascent. For efficiency, the gradient is often computed on a small subset of 
the training data, i.e., a mini-batch. 

We summarize the resulting training algorithm in Fig. [T] On a high level it consists of four steps 
which are iterated until a stopping criterion is met: (i) the forward pass to compute the scoring 
function F(x,y;w) for all output space configurations y G y. (ii) normalizing the scoring function 
via a soft-max computation to obtain the probability distribution p(y \ x,w). (iii) computation and 
back-propagation of the gradient of the loss function, i.e., often the log-likelihood or equivalently 
the cross-entropy, (iv) an update of the parameters. 

However, solving the inference task given in Eq. ([!]) or the learning problem stated in Eq. ^ is 
computationally challenging if we consider more complex output spaces y, e.g., those arising from 
tasks like image tagging. The situation is even more severe if we target image segmentation where 
the exponential number of possible output space configurations prevents even storage of F(x, y\ w) 
\/y G y. Note that this is required in the first line of the algorithm summarized in Fig. [Tj 

Given an exponential amount of possible configurations |^| = n^Li |3^|, how do we represent the 
scoring function F(x,y;w) efficiently? Assuming we have an efficient representation, how can we 
effectively normalize the probability p(y \ x,w)l One possible answer to those questions was given 
by Chen et al. (4J, who discussed extending log-linear models, i.e., those with a scoring function of 
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Algorithm: Learning Deep Structured Models 

Repeat until stopping criteria 

1 . Forward pass to compute f r (x, y r \ w) Vr G 7 Z,y r £ y r 

2. Computation of marginals b^ x ^ y p r (y r ) via loopy belief propagation, convex belief prop¬ 
agation or tree-reweighted message passing 

3. Backward pass through definition of function via chain rule 

4. Parameter update 

Figure 2: Approximated gradient descent for learning deep structured models. 

the form F(x, y\ w ) = w T (/)(x, y), to the more general setting, i.e., an arbitrary dependence of the 
scoring function F(x,y;w) on the parameter vector w. 

In short, m assumed the global scoring function F(x, y\ w) to decompose into a sum of local scoring 
functions f r , each depending on a small subset r C {1,..., N} of variables y r = (^)* Gr . All 
restrictions r required to compute the global function via 



(3) 


r£.lZ 


are subsumed in the set 1Z. If the size of each and every local restriction set r G 1Z is small, 
F(x,y;w) is efficiently representable. 

To compute the gradient of the log-likelihood cost function, we require a properly normalized dis¬ 
tribution p(y | x,w), or more specifically its marginals b^ x ^ y p r (y r ) for each restriction r G 1Z. To 
this end, message passing type algorithms were employed by 14j. Such an approach is exact if the 
distribution p(y \ x, w) is of low tree-width. Otherwise computational complexity is prohibitively 
large and approximations like loopy belief propagation f26t convex belief propagation |39l or tree- 
reweighted message passing (37l are alternatives that were successfully applied. 

The resulting iterative method of (4j is summarized in Fig. [2] In a first step the forward pass com¬ 
putes all outputs of every local scoring function. Afterwards (approximate) marginals are obtained 
in a second step, and utilized to compute the derivative of the (approximated) maximum likelihood 
cost function w.r.t. the parameters w. The following backward pass computes the gradient of the 
parameters by repeatedly applying the chain-rule according to the definition of the scoring function 
F(x,y;w). The gradient is then utilized during the final parameter update. 

Not only does the approach presented by (4J fail if the decomposition assumed in Eq. ([3} is not 
available. But it is also computationally challenging to obtain the required marginals if too many 
local functions are required. I.e., computation is slow if the number of restrictions \1Z\ is large, e.g., 
when working with densely connected image segmentation models where every pixel is possibly 
correlated to every other pixel in the image. 

3 Approach 

Densely connected models were previously considered by lfT71 l33l l34l [l8l and shown to yield im¬ 
pressive results for the image segmentation task. Learning the parameters of densely connected 
models was considered by Krahenbiihl and Koltun fl8l in the context of the log-linear setting. Fol¬ 
lowing I) we aim at extending those fully connected log-linear models to the more general setting 
of an arbitrary function F(x, y\ w), e.g., a deep convolutional neural network. Note that a similar 
approach has been recently discussed by BTj in independent work. 

Let us consider within this section how to efficiently combine deep structured prediction [4 ] with 
densely connected probabilistic models 02 Ei si na. Before getting into the details we note 
that the presented approach trades computational complexity of the general method of (4) with a 
restriction on the pairwise functions fy (i.e., r = Concretely, the local functions fy are 

assumed to be mixtures of kernels in a feature space as detailed below. For simplicity we assume 
that local functions of order higher than two are not required to represent our global scoring function 
F(x, y\ w). Generalizations have however been presented, e.g., by Vineet et al. (34). 
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3.1 Inference 


We begin our discussion by considering the inference task. To obtain a computationally efficient 
prediction algorithm we use a mean field approximation of the model distribution p(y \ x,w) 
for every sample (x, y). More formally, we assume our approximation to factor according to 

Q(x,y) ( V ) = YiiLi Q(x,y),i {Vi)- Given some parameters w, we employ a forward pass to obtain our lo¬ 
cal function representations / r (x, y r ',w)- Next we compute the single variable marginals q( x ,y),i{yi) 
by minimizing the Kullback-Leibler (KL) divergence w.r.t. to the assumed factorization of the mean 
field distribution q( x , y )(y), i.e., 

= Mg™nD KL {q {Xty) (y)\\p(y | x,w)). (4) 

Hereby q G A requires q to be a valid probability distribution. Due to non-convexity, only con¬ 
vergence to a stationary point of the KL divergence cost function is guaranteed for sequential 
block-coordinate updates [38, 16]. More precisely, iterating until convergence through the variables 
i G {1,..., TV} using the closed form update 

q(x, y ),i(yi) oc exp I fi(yi,x,w) + ^ f ij {y i ,y j ,x,w)q^ ty)J (y j ) I, (5) 

\ jeJV(i),yj J 

which assumes all marginals but q( x , y ),i to be fixed, retrieves a stationary point for the cost function 
of the program given in Eq. 0- The set of variables neighboring i is denoted A f(i). 

In the case of densely connected variables, the computational bottleneck arises from the second 
summand which involves l^il additions. The sum ranges over \J\f(i)\ = N — 1 terms for 

densely connected structured models. Hence the complexity of an update for a single marginal is 
of 0(N ), and updating all N marginals therefore requires 0(N 2 ) operations as also discussed by 
Krahenbiihl and Koltun fl8l . 

Importantly, Krahenbiihl and Koltun El observed that a high dimensional Gaussian filter can be 
applied to concurrently update all marginals in O(N). This is achievable when constraining our¬ 
selves to pairwise functions being mixtures of M kernels in the feature space as mentioned before. 
Formally, we require 

M 

fij(yi,Vj,x,w ) = ^ y (m) (yi,y j ,w)k {m \fi(x) - fj{x)), 

rn=l 

where is a label compatibility function, k^ is a kernel function, and fi(x) are features of 
variable i depending on the data x. 

However, to ensure convergence to a stationary point of the KL divergence cost function for this 
parallel update, further restrictions on the form of the pairwise functions fy apply. Formally, if the 
label compatibility functions are negative semi-definite Vra, and the kernels k^ are positive 
definite Vra, the KL divergence is readily given as the difference between a concave and a convex 
term CD. Hence the concave-convex procedure (CCCP) (401 is directly applicable. We therefore 
proceed iteratively by first linearizing the concave term at the current location and second minimiz¬ 
ing the resulting linearized but convex program. 

As detailed by Krahenbiihl and Koltun [18], and as discussed above, finding the linearization is 
equivalently solved via filtering in time linear in N. Solving the convex program in its original form 
requires solving a non-linear system of equations independently for each marginal q( x , y ),i(yi)i 
via Newton’s method. A further approximation to the cross-entropy term of the KL-divergence re¬ 
lates the efficient filtering based mean field update of the marginals q( x , y ),i(yi) to the corresponding 
cost function for which a stationary point is found. 

3.2 Learning 

Having observed that mean-field inference can be efficiently addressed with Gaussian filtering, given 
restrictions on the pairwise functions fij , we now turn our attention to the learning task. As men¬ 
tioned before we aim at finding a parameter vector w that maximizes the likelihood objective func¬ 
tion. Since the exact likelihood is computationally expensive, we use the log-likelihood based on 
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Algorithm: Learning Fully Connected Deep Structured Models 

Repeat until stopping criteria 

1. Forward pass to compute f r (x, y r ; w) Mr £lZ,y r E y r 

2. Computation of marginals q* x y ^ { (yi) via filtering for t E {1,..., T} 

3. Backtracking through the marginals q* x y ^ { (yi) from t = T — 1 down to t = 1 

4. Backward pass through definition of function via chain rule 

5. Parameter update 

Figure 3: Stochastic gradient descent for learning fully connected deep structured models. 


the mean-field marginals. Hence our surrogate loss function L( x ^ for a sample (x, y) with corre¬ 
sponding annotated ground truth labeling y is given by 

N 

I J (x,y){Q(x,y)') log Q(x,y),i(Vi)- (6) 

i= 1 


To perform a parameter update step we need the gradient of the surrogate loss function w.r.t. the 
parameters, i.e., 


&L( x ,y) dL(x,y) &Q(x,y) 

dw <9g(*, y ) dw 


The gradient of the surrogate loss function L( x ^ w.r.t. the marginals is easily obtained from Eq. 
It is given by 


®L(x,y) 
&Q(x,y),i iVi) 


1 

Q(x,y),i{l/i) 


lyi 


Vih 


©. 

( 8 ) 


where the Iverson bracket \yi = yi\ equals one if yi = yi, and returns zero otherwise. 


To perform a gradient step during learning, we additionally require the derivatives of the marginals 
w.r.t. the parameters, i.e., . 

More carefully investigating the mean-field update given in Eq. HI reveals a recursive definition. 

More concretely, the derivative dq(x, Q w z ^^ of the marginal q* x y ^ i (^) after t iterations depends on 
the results from earlier iterations. Hence, we obtain the desired result by successively back-tracking 
through the mean-field iterations from the last iteration back to the first. This direct computation is 
however computationally expensive. Fortunately, back-substitution into the loss gradient yields an 
algorithm which requires a total of T back-tracking steps, independent of the number of parameters. 
We refer the interested reader to ED for additional details regarding the computation of the gradient 
dq(x,y),i(yi) 

dw 

But contrasting fl8l , we no longer assume the unaries to be given by a logistic regression model. 
Contrasting m, we don’t assume the unaries to be fixed during CRF parameter updates. Gener¬ 
alizing the gradient of the marginals w.r.t. parameters to arbitrary unaries is straightforward since 
the gradients are directly given by the marginals. Combined with the gradient of the log-likelihood 

loss function w.r.t. the marginals, given in Eq. ([8|, we obtain dL ^ v) as the difference between the 
ground-truth and the predicted marginals. This result is then used for back-propagation through 
any functional structure which provides the unary scoring functions fi, e.g., convolutional neural 
networks. 


Derivatives w.r.t. to label compatibility and kernel shape parameters are readily given in m. The 
resulting algorithm is summarized in Fig. [3] In short, we first obtain again our functional repre- 
sentation via a forward pass through any functional network. Subsequently we compute our mean- 
field marginals via filtering. Afterwards we obtain the gradient of the loss function via an efficient 
back-tracking. In the next step the gradient of the parameters is computed by back-propagating the 
gradient of the loss-function using the chain-rule dictated by the definition of the scoring function. 
In a final step we update the parameters. 
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Figure 4: (a) Validation set performance over the number of iterations when fine-tuning the unary 
parameters only, (b) Validation set performance over the number of iterations when fine-tuning all 
parameters. 


4 Experiments 

We evaluate our approach summarized in Fig.[3]on the dataset of the Pascal VOC 2012 challenge (3. 
The task is semantic image segmentation of 21 object classes (including background). The original 
dataset contains 1464 training, 1449 validation and 1456 test images. In addition to this data we 
make use of the annotations provided by Hariharan et al. GE resulting in a total of 10582 training 
instances. The reported performance is measured using the intersection-over-union metric. Note 
that we conduct our tests on the 1449 validation set images which were neither used during training 
nor for fine-tuning. 

4.1 Model 

Our model setup follows 0, i.e., we employ the 16 layer DeepNet model ED Just like 0 we first 
convert the fully connected layers into convolutions as first discussed in 021 30) • This is useful since 
we are not interested in a single variable output prediction, but rather aim at learning probability 
masks. To obtain a larger probability mask we skip downsampling during the last two max-pooling 
operations. To take into account the skipped downsampling during subsequent convolutions we 
employ the ‘a trous (with hole) algorithm’ l23l . It takes care of the fact that data is stored in an 
interleaved way, i.e., in our case convolutions sub-sample the input data by a factor of two or four 
respectively. To adapt to the 21 object classes we also replace the top layer of the DeepNet model to 
yield 21 classes for each pixel. 

Similar to 0 we assume the input size of our network to be of dimension 306 x 306 which results 
in a 40 x 40 sized spatial output of the DeepNet which is in our case an intermediate result however. 

Contrasting [3], we jointly optimize for both unary and CRF parameters using the algorithm pre¬ 
sented in Fig. [3] To this end, given images downsampled to a size of 306 x 306, our algorithm first 
performs a forward pass through the convolutional DeepNet to obtain the 40 x 40 x 21 sized class 
probability maps in an intermediate stage. These intermediate class probability maps are directly 
up-sampled to the original image dimension using a bi-linear interpolation layer. This yields the 
actual output of our augmented DeepNet network defining the scoring function F(x, y , w). Note 
that the number N of variables y =» (jji, ..., yjsf) G y is therefore equal to the number of pixels of 
the original image. 

For the second step of our algorithm we perform 5 iterations of mean field updates to compute 
the marginals q( x , y ),i(yi) of the fully connected CRF. Those are then compared to the original 
groundtruth image segmentations, using as our loss function the sum of cross-entropy terms, i.e., 
the log-likelihood loss, as specified in Eq. ([6]). In the third step we back-track through the marginals 
to obtain a gradient of the loss function. Afterwards we back-propagate the derivatives w.r.t. the 
unary term through both the bi-linear interpolation and the 16-layer convolutional network. The 
shape and compatibility parameters of the CRF, detailed below, are updated directly. 
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Data 

bkg 

areo 

bike 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

Valid. 

90.461 

77.455 

30.355 

76.564 

60.735 

65.075 

81.261 

74.958 

81.505 

23.367 

66.279 

Train 

90.159 

76.314 

64.450 

78.677 

68.224 

68.044 

84.491 

80.274 

86.347 

44.567 

79.987 


Data 

table 

dog 

horse 

mbike 

person 

plant 

sheep 

sofa 

train 

tv 

Our mean 

nr 

Valid. 

52.219 

70.624 

66.660 

65.725 

72.913 

42.174 

73.452 

43.412 

71.738 

58.322 

64.060 

63.74 

Train 

62.710 

82.987 

76.729 

76.523 

75.399 

63.863 

79.937 

55.146 

80.699 

70.164 

73.604 

- 


Table 1: Performance of our approach for individual classes. In the last two columns of the lower 
panel we compare our mean to the recently presented baseline by Chen et al. El 


It was shown independently by many authors PTl HI. that successively increasing the number of 
parameters during training typically yields better performance due to better initialization of larger 
models. We therefore train our model in two stages. First, we assume no pairwise connections to be 
present, i.e., we fine-tune the weights obtained from the DeepNet ImageNet model | 3T, 29 ] to the 
Pascal dataset m. Standard parameter settings for a momentum of 0.9, a weight decay of 0.0005 
and learning rates of 0.01 and 0.001 for the top and all other layers are employed respectively. Due 
to the 12GB memory restrictions on the Tesla K40 GPU we use a mini-batch size of 20 images. 

In a second stage we jointly train the convolutional network parameters as well as the compatibility 
and shape parameters of the dense CRF arising from the pairwise functions 

2 

fij(yi,Vj,x,w) = niiii,iij) Y W m k {m) {fl m \x) - /++)). (9) 

m =1 

Hereby, we employ the Potts potential /i(yi,yj) = \yi = yj\ and the Gaussian kernels given by 

fc (m) = exp (-!(/+ - f$ m) ) T ^ 1 (ft ) ~ ++) ■ 

As indicated in Eq. ([9]), we use M = 2 kernels, both with diagonal covariance matrix £ m . One 
containing as features fi (x) the two-dimensional pixel positions, the other one containing as features 
the two dimensional pixel positions as well as the three color channels. Hence we obtain a total of 
nine parameters, i.e., two compatibility parameters w\ and and 2+5 = 7 kernel shape parameters 
for the diagonal covariance matrices E m . 

4.2 Results 

As mentioned before, all our results were computed on the validation set of the Pascal VOC dataset. 
This part of the data was neither used for training nor for fine-tuning. 

Unary performance: We first investigate the performance of the first training stage of the proposed 
approach, i.e., fine-tuning of the 16 layer DeepNet parameters on the Pascal VOC data. The valida¬ 
tion set accuracy is plotted over the number of iterations in Fig. [4] (a). We observe the performance 
to peak at around 4000 iterations with a mean intersection over union measure of 61.476%. The 
result reported by El for this experiment is 59.80%, i.e., we outperform their unary model by 1.5%. 

Joint training: Next we illustrate the performance of the second step, i.e., joint training of both 
convolutional network parameters and CRF compatibility and shape parameters. In Fig. [4] (b) we 
indicate the best obtained unary performance from the first step and visualize the validation and 
training set performance over the number of iterations. We observe the results to peak quickly after 
around 20 iterations and remain largely stable thereafter. 

Details: In Tab. [T] we provide the training and test set accuracies for the 21 individual classes. We 
observe the ‘bike’ and ‘chair’ class to be particularly difficult. For both categories the validation set 
performance is roughly half of the training set accuracy. 

Comparison to baseline: As provided in Tab. [T] the peak validation set performance of our ap¬ 
proach is 64.060%, which slightly outperforms the separate training result of 63.74% reported by 
Chen et al. El- 
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Figure 5: Visual results of good predictions. 


Visual results: We illustrate visual results of our approach in Fig. [5] Our method successfully 
segments the object if the images are clearly apparent. Noisy images and objects with many varia¬ 
tions pose challenges to the presented approach as visualized in Fig. [6] Also, we observe our learnt 
parameters to generally over-smooth results while being noisy on the boundaries. 


5 Discussion 


We presented a first method that jointly trains convolutional neural networks and fully connected 
conditional random fields for semantic image segmentation. To this end we generalize 01 to joint 
training. Note that a method along those lines has also been recently made publicly available in 
independent work ATI . Whereas the latter combines dense conditional random fields fTTl with the 
fully convolutional networks presented by Long et al. ED, we employ and modify the 16 layer 
DeepNet architecture presented in work by Simonyan and Zisserman eh. 

Ideas along the lines of joint training were discussed within machine learning and computer vision 
as early as the 90’s in work done by Bridle El and Bottou (H. More recently l5l l?7ll2Sl ll6ll28ll251 
incorporate non-linearities into unary potentials but generally assume exact inference to be tractable. 
Even more recently, Li and Zemel l20ll investigate training with hinge-loss objectives using non¬ 
linear unaries, but the pairwise potentials remain fixed, i.e., no joint training. Domke El decomposes 
the learning objective into logistic regressors which will be computationally expensive in our setting. 
Tompson et al. l32l propose joint training for pose estimation based on a heuristic approximation 
which ignores the normalization constant of the model distribution. Joint training of conditional 
random fields and deep networks was also discussed recently by (H for graphical models in general. 
Techniques based on convex and non-convex approximations were described for obtaining marginals 
in the general non-linear setting. 
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Figure 6: Failure cases 


6 Conclusion 

We discussed a method for semantic image segmentation that jointly trains convolutional neural net¬ 
works and conditional random fields. Our approach combines techniques from deep convolutional 
neural networks with variational mean-field approximations from the graphical model literature. We 
obtain good results on the challenging Pascal VOC 2012 dataset. 

In the future we plan to train our method on larger datasets. Additionally we want to investigate 
training with weakly labeled data. 
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